Medical Foundation Model

MedMO: Grounding and Understanding
Multimodal LLMs for Medical Images

A powerful open-source medical foundation model that unifies visual grounding, clinical reasoning, and language understanding across diverse medical imaging modalities.

Ankan Deria Komal Kumar Adinath Madhavrao Dukre Eran Segal Salman Khan Imran Razzak

Mohamed bin Zayed University of Artificial Intelligence

26M+
Training Samples
45
Medical Datasets
+13.8%
VQA Improvement
+43.8
Grounding IoU Gain
4
Training Stages

Abstract

Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data.

MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios.

MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks, achieving state-of-the-art performance in medical VQA, report generation, and diagnostic reasoning. Evaluations across radiology, ophthalmology, pathology, and emergency care confirm MedMO's broad cross-modality generalization and reliable spatial reasoning.

Performance Comparison

MedMO achieves state-of-the-art results across diverse medical imaging tasks

MedMO Performance Comparison Chart
Figure: Benchmark performance of MedMO across diverse medical imaging tasks, including VQA, QA, report generation, and grounding. MedMO achieves consistent gains over prior models, with improvements of +0.3% on MMMU-Med, +0.2% on MedXQA, +2.9% on MMLU-Med, +17.7% on MedQA, +5.1% on MIMIC-CXR, and a substantial +43.8 IoU on Bacteria segmentation. The large boost in Bacteria IoU stems from the incorporation of fine-grained grounding supervision and high-resolution microscopy data, highlighting MedMO's enhanced spatial reasoning and localization capabilities.

Motivation & Challenges

Addressing critical limitations in existing medical MLLMs

Reliance on Distilled Data

Most existing models rely on distilled data from proprietary models, which often lack accurate domain grounding for fine-grained clinical reasoning.

Hallucination Risks

Distillation pipelines without structured supervision amplify hallucination risks and inconsistencies in medical outputs.

Narrow Modality Coverage

Current models focus on individual tasks or narrow modality subsets rather than achieving unified, cross-modal generalization.

Multi-Stage Training Pipeline

Progressive post-training for comprehensive medical image understanding

MedMO Training Pipeline and Capabilities
1
General Medical SFT
Large-scale training for foundational understanding
18.5M samples 768×768
2
High-Resolution SFT
Spatial localization & fine-grained grounding
3M samples 1280×1280
3
Instruction Tuning
Human-style medical instruction following
4.3M samples Multi-task
4
Reinforcement Learning
GRPO with verifiable rewards
300K samples BBox IoU
01

Cross-Modal Pretraining

Align heterogeneous visual encoders with a medical language backbone using DeepStack fusion mechanism.

02

Multi-Task Supervision

Training spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes.

03

Verifiable Rewards

Novel bounding-box GIoU reward combined with factuality checks for enhanced spatial grounding.

04

Scalable Architecture

Built upon Qwen3-VL with a modular design enabling future expansion across additional modalities.

Benchmark Results

State-of-the-art performance across medical VQA, Text QA, and Grounding tasks

Key Performance Insights

BASELINE (Qwen3-VL-8B)
VQA: 48.8% · Text QA: 54.5%
CURRENT SOTA (Fleming-VL-8B)
Best VQA Avg: 64.4% · Text QA: 46.9%
MEDMO-4B (OURS)
VQA: 52.0% · Text QA: 56.2%
🏆 MEDMO-8B (OURS) — NEW SOTA
VQA: 62.6% · Text QA: 61.5%

MedMO-8B achieves the best overall balance among open-source models, outperforming both Lingshu-7B and Fleming-VL-8B with the strongest Text-QA results (+14.6% over Fleming-VL) while maintaining competitive VQA performance within 1.8% of SOTA.

Medical VQA Benchmarks

Model MMMU-Med VQA-RAD SLAKE PathVQA PMC-VQA OmniMedVQA MedXQA Avg.
GPT-4.1 75.2 65.0 72.2 55.5 55.2 75.5 45.2 63.4
Claude Sonnet 4 74.6 67.6 70.6 54.2 54.4 65.5 43.3 61.5
Gemini-2.5-Flash 76.9 68.5 75.8 55.4 55.4 71.0 52.8 65.1
Fleming-VL-8B 63.3 66.1 86.5 62.9 64.3 86.7 21.6 64.4
Lingshu-7B 54.0 67.9 83.1 61.9 56.3 82.9 26.7 61.8
Qwen3VL-8B (Baseline) 61.4 64.1 47.3 14.6 52.3 77.2 24.8 48.8
MedMO-4B (Ours) 54.6 50.9 41.0 62.4 50.6 79.7 24.8 52.0↑+3.2
MedMO-8B (Ours) 64.6 64.7 81.6 56.3 59.4 84.8 26.9 62.6↑+13.8

Medical Text QA Benchmarks

Model MMLU-Med PubMedQA MedMCQA MedQA Medbullets MedXQA SGPQA Avg.
GPT-4.1 89.6 75.6 77.7 89.1 77.0 30.9 49.9 70.0
Claude Sonnet 4 91.3 78.6 79.3 92.1 80.2 33.6 56.3 73.1
Gemini-2.5-Flash 84.2 73.8 73.6 91.2 77.6 35.6 53.3 69.9
Fleming-VL-8B 71.8 74.0 51.8 53.7 40.5 12.1 24.9 46.9
Lingshu-7B 74.5 76.6 55.9 63.3 56.2 16.5 26.3 52.8
Qwen3VL-8B (Baseline) 79.3 70.4 60.0 66.1 56.1 15.1 34.7 54.5
MedMO-4B (Ours) 75.7 78.0 58.0 78.5 57.5 16.4 29.4 56.2↑+1.7
MedMO-8B (Ours) 82.2 76.8 65.0 83.8 65.2 20.4 37.2 61.5↑+7.0

Medical Report Generation

Semantic (ROUGE-L, CIDEr) and model-based (RaTE, Semb) metrics

Model MIMIC-CXR CheXpert Plus IU-Xray Med-Trinity
R-LCIDErRaTESemb R-LCIDErRaTESemb R-LCIDErRaTESemb R-LCIDErRaTESemb
GPT-4.1 9.082.851.323.9 24.578.845.523.2 30.2124.651.347.5
Gemini-2.5-Flash 25.480.750.329.7 23.672.244.327.4 33.5129.355.650.9
Lingshu-7B 30.8109.452.130.0 26.579.045.426.8 41.2180.757.648.4 16.074.544.424.0
Fleming-VL-8B 35.7132.556.733.6 26.182.247.140.1 44.9198.666.051.3 13.135.841.918.1
Qwen3VL-8B (Baseline) 25.177.950.333.4 21.967.444.437.9 25.091.452.542.9 20.269.945.933.6
MedMO-4B (Ours) 26.092.649.831.6 15.162.336.634.2 26.694.042.141.3 22.5152.647.834.3
MedMO-8B (Ours) 31.7140.057.150.0 23.687.547.342.2 31.1169.745.341.3 37.0270.453.039.2
Key Result: MedMO-8B achieves CIDEr 140.0 and Semb 50.0 on MIMIC-CXR — best semantic coherence and clinical accuracy. On Med-Trinity (diverse modalities), MedMO-8B dramatically outperforms with CIDEr 270.4 (vs 81.5 for next best).

Medical Grounding Benchmarks (IoU %)

Model NIH Chest DeepLesion Bacteria MedSG (multi-view) MedSG (tracking) Avg.
InternVL3-8B 10.1 0.0 0.7 6.3 13.0 5.6
Fleming-VL-8B 0.0 0.0 8.3 42.0 36.7 17.2
Lingshu-7B 5.3 0.7 0.0 28.3 38.7 13.9
Qwen3VL-8B 16.4 0.0 9.16 8.4 17.8 13.8
MedMO-8B (Ours) 8.83 38.5 54.6 75.8 77.2 54.2↑+40.4

Key Contributions

01

Open-Source Foundation Model

A powerful open-source post-trained multimodal large VLM designed for comprehensive medical image understanding and grounding, available in 4B and 8B variants.

02

Scalable Training Pipeline

Curated 26M+ multimodal medical samples from 45 datasets with a multi-stage post-training pipeline that progressively enhances cross-modal alignment.

03

Novel Evaluation Benchmark

Constructed a dedicated Cell dataset from open-source microscopy images with varying sizes, shapes, and densities for evaluating VLM detection capabilities.

04

Comprehensive Analysis

Extensive experiments across data and methodology dimensions, providing an open benchmark for future multimodal medical LLM research.

Unified Multimodal Medical Dataset

Dataset composition covering imaging modalities and biological systems
Composition of the unified multimodal medical dataset comprising diverse imaging modalities (X-ray, CT, MRI, Ultrasound, Nuclear Medicine, Optical, Pathology) and biological systems (Respiratory, Cardiovascular, Nervous, Digestive, Urinary, Musculoskeletal, and more).

Qualitative Results

MedMO demonstrates superior diagnostic accuracy and clinical reasoning

🔬 Dermatology Diagnosis
Question
What is the name of the skin abnormality in this image?
Options: A. Eczema, B. Squamous cell carcinoma, C. Malignant melanoma, D. Melanoma
Other Models (Fleming-VL, Qwen3-VL, Lingshu)
B. Psoriasis ❌
✓ MedMO
B. Squamous cell carcinoma
🦠 Cell Detection & Grounding
Question
Detect and localize all cells in the image.
Ground Truth
[[54,545,63,554]]
Other Models
Fleming-VL: [0,0,999,999] ❌
Qwen3-VL: [31,21,965,957] ❌
✓ MedMO
[53,548,62,557] ✓ (Near-perfect localization)